Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix](parquet-reader) Fixed the issue of excessive scanning data in late materialization‌ case of parquet reader #46121

Merged

Conversation

kaka11chen
Copy link
Contributor

@kaka11chen kaka11chen commented Dec 27, 2024

What problem does this PR solve?

Related PR: #40641

Problem Summary:

Fix Fixed the issue of excessive scanning data in late materialization‌ case of parquet reader introduced by #40641 in scenarios with particularly high filtering rates.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    1. enable profile and run the sql:
    set enable_profile=true;
    select * from parquet_table where x like 'xxx'; // This sql will return null;
    
    
    1. observe the file bytes result in profile which should scan only x column's bytes.
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@kaka11chen
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@doris-robot
Copy link

TPC-H: Total hot run time: 32627 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 0796ceccfe226972cbff6afc4baf5a5e8536bd31, data reload: false

------ Round 1 ----------------------------------
q1	17591	6139	6042	6042
q2	2056	312	164	164
q3	10409	1222	758	758
q4	10231	876	446	446
q5	7875	2182	2011	2011
q6	202	182	147	147
q7	926	765	614	614
q8	9246	1383	1197	1197
q9	5251	4867	4892	4867
q10	6766	2317	1872	1872
q11	489	282	265	265
q12	351	357	225	225
q13	17780	3553	3020	3020
q14	232	224	224	224
q15	572	504	500	500
q16	660	615	578	578
q17	594	856	337	337
q18	6877	6361	6407	6361
q19	2115	972	540	540
q20	299	319	189	189
q21	2933	2286	1963	1963
q22	369	329	307	307
Total cold run time: 103824 ms
Total hot run time: 32627 ms

----- Round 2, with runtime_filter_mode=off -----
q1	6335	6262	6247	6247
q2	239	323	242	242
q3	2247	2630	2310	2310
q4	1365	1806	1341	1341
q5	4328	4728	4809	4728
q6	186	178	147	147
q7	2090	1931	1796	1796
q8	2671	2834	2681	2681
q9	7386	7239	7332	7239
q10	3098	3320	2876	2876
q11	594	521	488	488
q12	647	716	606	606
q13	3429	3840	3135	3135
q14	279	287	288	287
q15	571	511	513	511
q16	664	698	648	648
q17	1234	1767	1260	1260
q18	7537	7393	7285	7285
q19	858	1142	1203	1142
q20	2014	2059	1901	1901
q21	5729	5348	4844	4844
q22	606	583	562	562
Total cold run time: 54107 ms
Total hot run time: 52276 ms

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 38.89% (10119/26022)
Line Coverage: 29.88% (85521/286184)
Region Coverage: 29.02% (43704/150614)
Branch Coverage: 25.55% (22295/87256)
Coverage Report: http://coverage.selectdb-in.cc/coverage/0796ceccfe226972cbff6afc4baf5a5e8536bd31_0796ceccfe226972cbff6afc4baf5a5e8536bd31/report/index.html

@doris-robot
Copy link

TPC-DS: Total hot run time: 197633 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 0796ceccfe226972cbff6afc4baf5a5e8536bd31, data reload: false

query1	1300	952	918	918
query2	6354	2338	2324	2324
query3	10976	4718	4791	4718
query4	33207	23708	23426	23426
query5	4379	634	450	450
query6	264	190	186	186
query7	3998	489	298	298
query8	287	231	223	223
query9	9474	2732	2739	2732
query10	490	312	251	251
query11	17979	15517	15268	15268
query12	167	106	111	106
query13	1639	568	430	430
query14	11289	7758	7029	7029
query15	260	214	190	190
query16	8691	647	492	492
query17	1553	787	593	593
query18	2134	415	316	316
query19	216	195	172	172
query20	132	120	119	119
query21	213	127	106	106
query22	4535	4551	4513	4513
query23	34313	33677	33744	33677
query24	6282	2307	2288	2288
query25	500	459	402	402
query26	809	286	150	150
query27	2044	473	337	337
query28	5670	2512	2442	2442
query29	679	593	438	438
query30	205	177	146	146
query31	977	952	849	849
query32	88	59	56	56
query33	487	374	311	311
query34	761	883	534	534
query35	830	842	798	798
query36	1030	1067	983	983
query37	120	98	72	72
query38	4129	4266	4138	4138
query39	1533	1473	1501	1473
query40	211	117	104	104
query41	44	43	45	43
query42	125	103	99	99
query43	525	537	506	506
query44	1403	862	830	830
query45	182	175	187	175
query46	880	1069	661	661
query47	2020	2022	1936	1936
query48	392	425	373	373
query49	707	465	378	378
query50	640	667	397	397
query51	7329	7358	7267	7267
query52	105	98	95	95
query53	220	254	180	180
query54	475	489	418	418
query55	88	83	80	80
query56	255	279	237	237
query57	1285	1269	1180	1180
query58	240	222	230	222
query59	3206	3278	3281	3278
query60	276	269	251	251
query61	106	104	107	104
query62	865	821	711	711
query63	229	194	194	194
query64	3295	1047	675	675
query65	3354	3347	3278	3278
query66	775	404	308	308
query67	16551	15847	15482	15482
query68	9700	753	509	509
query69	474	291	256	256
query70	1235	1119	1151	1119
query71	446	288	245	245
query72	6271	3935	3862	3862
query73	671	754	360	360
query74	10145	9017	8901	8901
query75	4582	3172	2668	2668
query76	4643	1194	767	767
query77	858	347	276	276
query78	10189	10175	10073	10073
query79	4965	869	582	582
query80	693	515	430	430
query81	501	264	225	225
query82	247	150	120	120
query83	189	160	145	145
query84	291	85	72	72
query85	763	364	301	301
query86	353	309	298	298
query87	4465	4439	4470	4439
query88	3533	2237	2208	2208
query89	408	329	282	282
query90	2121	189	182	182
query91	140	134	106	106
query92	69	54	51	51
query93	2116	857	525	525
query94	651	452	279	279
query95	325	257	256	256
query96	488	617	288	288
query97	2776	2854	2701	2701
query98	217	201	193	193
query99	1713	1548	1498	1498
Total cold run time: 302015 ms
Total hot run time: 197633 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.37 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 0796ceccfe226972cbff6afc4baf5a5e8536bd31, data reload: false

query1	0.04	0.03	0.03
query2	0.07	0.04	0.03
query3	0.24	0.07	0.07
query4	1.61	0.11	0.10
query5	0.42	0.42	0.43
query6	1.18	0.65	0.65
query7	0.02	0.01	0.02
query8	0.04	0.03	0.03
query9	0.61	0.51	0.51
query10	0.55	0.57	0.56
query11	0.14	0.11	0.10
query12	0.14	0.11	0.11
query13	0.62	0.60	0.61
query14	2.87	2.90	2.72
query15	0.90	0.83	0.82
query16	0.38	0.39	0.37
query17	1.02	1.02	1.04
query18	0.22	0.21	0.20
query19	1.92	1.80	1.97
query20	0.02	0.01	0.01
query21	15.37	0.92	0.56
query22	0.75	0.94	0.65
query23	15.14	1.47	0.57
query24	3.34	1.11	1.13
query25	0.21	0.26	0.06
query26	0.28	0.14	0.13
query27	0.05	0.08	0.04
query28	14.03	1.53	1.04
query29	12.59	4.03	3.33
query30	0.25	0.09	0.06
query31	2.84	0.60	0.38
query32	3.23	0.55	0.46
query33	3.18	3.12	3.10
query34	16.52	5.15	4.46
query35	4.50	4.48	4.47
query36	0.62	0.52	0.48
query37	0.10	0.06	0.06
query38	0.04	0.04	0.03
query39	0.04	0.02	0.02
query40	0.16	0.14	0.13
query41	0.08	0.02	0.02
query42	0.03	0.03	0.02
query43	0.03	0.03	0.03
Total cold run time: 106.39 s
Total hot run time: 31.37 s

@kaka11chen kaka11chen marked this pull request as ready for review December 30, 2024 02:03
@kaka11chen kaka11chen changed the title [Fix](parquet-reader) fix parquet late materialization‌ by resolve scanner starve problem. [Fix] (parquet-reader) Fixed the issue of excessive scanning data in late materialization‌ case of parquet reader introduced by #40641 in scenarios with particularly high filtering rates. Dec 30, 2024
@morningman morningman changed the title [Fix] (parquet-reader) Fixed the issue of excessive scanning data in late materialization‌ case of parquet reader introduced by #40641 in scenarios with particularly high filtering rates. [Fix] (parquet-reader) Fixed the issue of excessive scanning data in late materialization‌ case of parquet reader Dec 30, 2024
@morningman morningman changed the title [Fix] (parquet-reader) Fixed the issue of excessive scanning data in late materialization‌ case of parquet reader [fix](parquet-reader) Fixed the issue of excessive scanning data in late materialization‌ case of parquet reader Dec 30, 2024
Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added approved Indicates a PR has been approved by one committer. reviewed labels Dec 30, 2024
Copy link
Contributor

PR approved by anyone and no changes requested.

Copy link
Contributor

@hubgeter hubgeter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@morningman morningman merged commit 0348b33 into apache:master Dec 30, 2024
28 of 32 checks passed
github-actions bot pushed a commit that referenced this pull request Dec 30, 2024
…ate materialization‌ case of parquet reader (#46121)

### What problem does this PR solve?

Related PR: #40641

Problem Summary:

[Fix](parquet-reader) Fixed the issue of excessive scanning data in late
materialization‌ case of parquet reader introduced by #40641 in
scenarios with particularly high filtering rates.
github-actions bot pushed a commit that referenced this pull request Dec 30, 2024
…ate materialization‌ case of parquet reader (#46121)

### What problem does this PR solve?

Related PR: #40641

Problem Summary:

[Fix](parquet-reader) Fixed the issue of excessive scanning data in late
materialization‌ case of parquet reader introduced by #40641 in
scenarios with particularly high filtering rates.
morningman pushed a commit that referenced this pull request Dec 30, 2024
…ng data in late materialization‌ case of parquet reader #46121 (#46183)

Cherry-picked from #46121

Co-authored-by: Qi Chen <[email protected]>
morningman pushed a commit that referenced this pull request Dec 31, 2024
…ng data in late materialization‌ case of parquet reader #46121 (#46182)

Cherry-picked from #46121

Co-authored-by: Qi Chen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/2.1.8-merged dev/3.0.4-merged reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants